Estimating Concrete Compressive Strength from Use Materials and the Concrete Age (ECCS from UM and CA)
Abstract
This project uses Prof. I-Cheng Yeh’s dataset of (Concrete Data.xls) that contains more than 1000 observations with nine variables. The variable of Concrete Compressive Strength (MPa) is related to Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate, and age of samples. An acceptable model is needed to be found to have an equation to estimate Concrete Compressive Strength utilizing eight independent variables from dataset. The dataset is scaled. Subset Selection Methods and Stepwise Regression are used resulting in using all regressors obtaining higher Adjusted R squared and lower Root Mean Square Error. The 3 order of Multivariate Polynomial Regression with eight regressors has fitted utilizing 1005 observations that are not duplicated. The fitted model is significant and has 0.909 of an adjusted R squared and 62.17 of F-statistic.
2. Introduction
Concrete Compressive Strength is one of the main tests of concrete samples. This test, used for Compressing Concrete samples for education purposes, research purposes, or measuring quality, determines how strong hardened concrete samples are. It is known that the ratio of cement water plays a role of increasing or decreasing the strength of concrete. Choosing appropriate ratio of water to cement is considered essential for making concrete whose compressive strength is significant.Every decreasing in the weight ratio of cement water leads to an increase in the compressive strength of concrete. The low ratio of cement weight to water weight in concrete mixtures causes poor workability, which means it is hard to be mixed (concrete mixtures). Plasticizers are used to improve the workability of a concrete mixture that is desired to contain less amount of water for raising a compressive Strength. Fly ash and slag cement (Class F) are used in a mixture of concrete to protect the concrete from Alkali-Silica Reaction and sulfate attack (IS-11). This The dataset that is used in this project belongs to Prof. I-Cheng Yeh’s. This data, about component of concrete, contains nine variables and 1030 records of each variable. This dataset has eight regressors or/and independent variables which are cement (kg in a m3 mixture), Blast Furnace Slag (kg in a m3 mixture), Fly Ash (kg in a m3 mixture), Water(kg in a m3 mixture), Superplasticizer (kg in a m3 mixture), Coarse Aggregate (kg in a m3 mixture), Fine Aggregate (kg in a m3 mixture), Age (1 ~ 365 days). These quantitative independent variables control of concrete compressive strength (MPa). It is a chance to find a good model of this dataset and obtain an equation from which the researcher can learn how use materials and the age of concrete: - how many days concrete samples since it has been made, determine the strength of concrete samples or the concrete. The researcher look for answers of these questions: 1-what regressors that are suggested to be on the model ? 2- What is type of the model that have a good fit the dataset?
a. Data
I know that the ratio of water to cement has an impact on the compressive strength of concrete.
My motivation is that
I want to know that the effects of each independent variable, including water and cement in this
model how concrete compressive strength will got affacted by regressor coefficients.
1- Cement is the main element of concrete. Its job is to glue other concrete parts such as sand and aggregate to gather them after mixing with water.
2- Fine aggregates are aggregates that are passing through a 4.76 mm sieve.
3- And remained aggregates are not passing the sieve, are called Coarse Aggregate.
4- Water is no need to be defined.
5- Super-plasticizers are used to maintain high workability while at the same time maintaining strength ( Ruwan Rajapakse).
6- “Fly ash is the finely divided residue that results from the combustion of pulverized coal and is transported from the combustion chamber by exhaust gases” (Fly Ash Facts for Highway Engineers)
7- "Blast Furnace Slag is formed when iron ore or iron pellets, coke and a flux (either limestone or dolomite) are melted together in a blast furnace. When the metallurgical smelting process is complete, the lime in the flux has been chemically combined with the aluminates and silicatesof the ore and coke ash to form a non-metallic product called blast furnace slag (National Slag Association).
8- Age of concrete
X1 : Cement (kg in a m3 mixture)
X2 : Blast Furnace Slag (kg in a m3 mixture)
X3 : Fly Ash (kg in a m3 mixture)
X4 : Water (kg in a m3 mixture)
X5 : Superplasticizer (kg in a m3 mixture)
X6 : Coarse Aggregate (kg in a m3 mixture)
X7 : Fine Aggregate (kg in a m3 mixture)
X8 : Age (Day (1~365))
Y : Concrete compressive strength (MPa)
The scatterplot for X1 vs X8 shows a potential curve relationship between X1 and X8 because of existing points in the middle of X8 = 2 and 5.
Call:
lm(formula = X1 ~ X8, data = df_new)
Residuals:
Min 1Q Median 3Q Max
-1.7580 -0.8266 -0.1605 0.6146 2.5629
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.742e-16 3.144e-02 0.000 1.00000
X8 8.635e-02 3.146e-02 2.745 0.00616 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9968 on 1003 degrees of freedom
Multiple R-squared: 0.007456, Adjusted R-squared: 0.006466
F-statistic: 7.534 on 1 and 1003 DF, p-value: 0.006161
Call:
lm(formula = X1 ~ poly(X8, degree = 3, raw = F), data = df_new)
Residuals:
Min 1Q Median 3Q Max
-1.8372 -0.8332 -0.1308 0.6053 2.5795
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.551e-16 3.121e-02 0.000 1.00000
poly(X8, degree = 3, raw = F)1 2.736e+00 9.893e-01 2.766 0.00579 **
poly(X8, degree = 3, raw = F)2 8.601e-01 9.893e-01 0.869 0.38485
poly(X8, degree = 3, raw = F)3 -4.006e+00 9.893e-01 -4.049 5.54e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9893 on 1001 degrees of freedom
Multiple R-squared: 0.02417, Adjusted R-squared: 0.02125
F-statistic: 8.266 on 3 and 1001 DF, p-value: 1.956e-05
The scatterplot for X5 vs X1 shows a potential curve relationship between X5 and X1.
Call:
lm(formula = X5 ~ X1, data = df_new)
Residuals:
Min 1Q Median 3Q Max
-1.1717 -0.9954 0.0417 0.6944 4.3090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.595e-16 3.150e-02 0.000 1.0000
X1 6.091e-02 3.152e-02 1.932 0.0536 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9986 on 1003 degrees of freedom
Multiple R-squared: 0.003709, Adjusted R-squared: 0.002716
F-statistic: 3.734 on 1 and 1003 DF, p-value: 0.05358
Call:
lm(formula = X5 ~ poly(X1, degree = 3, raw = F), data = df_new)
Residuals:
Min 1Q Median 3Q Max
-1.2049 -0.9705 0.0642 0.6993 4.2449
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.484e-16 3.147e-02 0.000 1.0000
poly(X1, degree = 3, raw = F)1 1.930e+00 9.976e-01 1.935 0.0533 .
poly(X1, degree = 3, raw = F)2 7.557e-01 9.976e-01 0.758 0.4489
poly(X1, degree = 3, raw = F)3 -1.883e+00 9.976e-01 -1.887 0.0594 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9976 on 1001 degrees of freedom
Multiple R-squared: 0.007809, Adjusted R-squared: 0.004835
F-statistic: 2.626 on 3 and 1001 DF, p-value: 0.04918
Call:
lm(formula = Y ~ X8, data = df_new)
Residuals:
Min 1Q Median 3Q Max
-2.31385 -0.67123 -0.08051 0.55588 2.94992
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.147e-16 2.971e-02 0.00 1
X8 3.374e-01 2.972e-02 11.35 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9418 on 1003 degrees of freedom
Multiple R-squared: 0.1138, Adjusted R-squared: 0.1129
F-statistic: 128.8 on 1 and 1003 DF, p-value: < 2.2e-16
Call:
lm(formula = Y ~ poly(X8, degree = 3, raw = F), data = df_new)
Residuals:
Min 1Q Median 3Q Max
-1.7943 -0.5954 -0.1437 0.4930 2.8486
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.416e-16 2.599e-02 0.000 1
poly(X8, degree = 3, raw = F)1 1.069e+01 8.239e-01 12.975 <2e-16 ***
poly(X8, degree = 3, raw = F)2 -1.217e+01 8.239e-01 -14.772 <2e-16 ***
poly(X8, degree = 3, raw = F)3 7.888e+00 8.239e-01 9.574 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8239 on 1001 degrees of freedom
Multiple R-squared: 0.3233, Adjusted R-squared: 0.3213
F-statistic: 159.4 on 3 and 1001 DF, p-value: < 2.2e-16
Call:
lm(formula = X1 ~ X8, data = df_new)
Residuals:
Min 1Q Median 3Q Max
-1.7580 -0.8266 -0.1605 0.6146 2.5629
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.742e-16 3.144e-02 0.000 1.00000
X8 8.635e-02 3.146e-02 2.745 0.00616 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9968 on 1003 degrees of freedom
Multiple R-squared: 0.007456, Adjusted R-squared: 0.006466
F-statistic: 7.534 on 1 and 1003 DF, p-value: 0.006161
Call:
lm(formula = X1 ~ poly(X8, degree = 3, raw = F), data = df_new)
Residuals:
Min 1Q Median 3Q Max
-1.8372 -0.8332 -0.1308 0.6053 2.5795
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.551e-16 3.121e-02 0.000 1.00000
poly(X8, degree = 3, raw = F)1 2.736e+00 9.893e-01 2.766 0.00579 **
poly(X8, degree = 3, raw = F)2 8.601e-01 9.893e-01 0.869 0.38485
poly(X8, degree = 3, raw = F)3 -4.006e+00 9.893e-01 -4.049 5.54e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9893 on 1001 degrees of freedom
Multiple R-squared: 0.02417, Adjusted R-squared: 0.02125
F-statistic: 8.266 on 3 and 1001 DF, p-value: 1.956e-05
Call:
lm(formula = X5 ~ X8, data = df_new)
Residuals:
Min 1Q Median 3Q Max
-1.1557 -1.0735 -0.0254 0.6217 4.5576
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.908e-16 3.096e-02 0.000 1
X8 -1.941e-01 3.098e-02 -6.266 5.51e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9815 on 1003 degrees of freedom
Multiple R-squared: 0.03767, Adjusted R-squared: 0.03671
F-statistic: 39.26 on 1 and 1003 DF, p-value: 5.505e-10
Call:
lm(formula = X5 ~ poly(X8, degree = 3, raw = F), data = df_new)
Residuals:
Min 1Q Median 3Q Max
-1.2462 -0.9436 -0.0146 0.5903 4.5494
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.990e-16 3.027e-02 0.000 1.00000
poly(X8, degree = 3, raw = F)1 -6.149e+00 9.595e-01 -6.409 2.25e-10 ***
poly(X8, degree = 3, raw = F)2 -2.924e+00 9.595e-01 -3.047 0.00237 **
poly(X8, degree = 3, raw = F)3 6.010e+00 9.595e-01 6.263 5.58e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9595 on 1001 degrees of freedom
Multiple R-squared: 0.08215, Adjusted R-squared: 0.0794
F-statistic: 29.86 on 3 and 1001 DF, p-value: < 2.2e-16
Data and Analytic Methods
b. Data management
c. Analytic Methods
The correlation of regressors matrices show that X5 vs X4 high correlation of -0.64694605. This is an indication of collinearity issue.
X1 X2 X3 X4 X5
X1 1.00000000 -0.30332393 -0.38560975 -0.05662462 0.06090565
X2 -0.30332393 1.00000000 -0.31235235 0.13026211 0.01980022
X3 -0.38560975 -0.31235235 1.00000000 -0.28331366 0.41421276
X4 -0.05662462 0.13026211 -0.28331366 1.00000000 -0.64694605
X5 0.06090565 0.01980022 0.41421276 -0.64694605 1.00000000
X6 -0.08620530 -0.27755887 -0.02646847 -0.21247975 -0.24172051
X7 -0.24537545 -0.28968490 0.09026178 -0.44491471 0.20799292
X8 0.08634782 -0.04275925 -0.15894045 0.27928351 -0.19407633
X6 X7 X8
X1 -0.086205301 -0.24537545 0.086347817
X2 -0.277558868 -0.28968490 -0.042759251
X3 -0.026468472 0.09026178 -0.158940454
X4 -0.212479752 -0.44491471 0.279283512
X5 -0.241720507 0.20799292 -0.194076334
X6 1.000000000 -0.16218697 -0.005263596
X7 -0.162186974 1.00000000 -0.156572492
X8 -0.005263596 -0.15657249 1.000000000
Stepwise Regression suggests having 8 regressors on the model because this model has the lowest Root Mean Square Error (RMSE = 0.6339170), the highest Coefficient of determination (R-squared = 0.5973950), and lowest Mean Absolute Error (MAE = 0.5039133).
nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 0.8704813 0.2458884 0.7151686 0.07288146 0.08200000 0.05992693
2 2 0.8129049 0.3413771 0.6651493 0.06214243 0.07415486 0.05382460
3 3 0.7260136 0.4738393 0.5798846 0.06588871 0.07705157 0.05824140
4 4 0.6783065 0.5387098 0.5483010 0.05508152 0.08051937 0.04727427
5 5 0.6358125 0.5950898 0.5071442 0.04549369 0.05452078 0.04171870
6 6 0.7564071 0.4281821 0.6111919 0.04863825 0.06022739 0.04317181
7 7 0.6986066 0.5099107 0.5576974 0.08976091 0.09666554 0.07866024
8 8 0.6339170 0.5973950 0.5039133 0.04596748 0.05467927 0.04467015
We can choose wanted variables in the model if the R Squared and RMSE are chosen from Stepwise Regression
Subset selection object
Call: regsubsets.formula(Y ~ ., data = df_new, nvmax = 8)
8 Variables (and intercept)
Forced in Forced out
X1 FALSE FALSE
X2 FALSE FALSE
X3 FALSE FALSE
X4 FALSE FALSE
X5 FALSE FALSE
X6 FALSE FALSE
X7 FALSE FALSE
X8 FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
X1 X2 X3 X4 X5 X6 X7 X8
1 ( 1 ) "*" " " " " " " " " " " " " " "
2 ( 1 ) "*" " " " " " " "*" " " " " " "
3 ( 1 ) "*" " " " " " " "*" " " " " "*"
4 ( 1 ) "*" "*" " " "*" " " " " " " "*"
5 ( 1 ) "*" "*" "*" "*" " " " " " " "*"
6 ( 1 ) "*" "*" "*" "*" "*" " " " " "*"
7 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*"
8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
Adjusted R squared
This section includes the coefficient determination of Adjusted R squared of the third order of Multivariate Polynomial Regression with eight regressors. The Adjusted R Squared is about 91%. About 91% of the total variation in the Concrete compressive strength (MPa) can be explained by the third order polynomial model using the eight regressors.
[1] " [1] is the coefficient determination of 3 order of Multivariate Polynomial Regression with 8 regressors "
[1] 0.9090186
\(H_0: \beta_i=0, i=1,2,3,4,5,6,7,8\) versus \(H_1:\) at least one of \(\beta_i\neq0, i=1,2,3,4,5,6,7,8\).
Analysis of Variance Table
Response: Y
Df Sum Sq
polym(X1, X2, X3, X4, X5, X6, X7, X8, degree = 3, raw = F) 164 927.58
Residuals 840 76.42
Mean Sq F value
polym(X1, X2, X3, X4, X5, X6, X7, X8, degree = 3, raw = F) 5.6559 62.166
Residuals 0.0910
Pr(>F)
polym(X1, X2, X3, X4, X5, X6, X7, X8, degree = 3, raw = F) < 2.2e-16 ***
Residuals
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1. Linearity Assumption
We check the linearity assumption: After checking the residual vs fitted value plot, most blue plots locate between 0.5 and -0.5 on residual axes and -2 and 2 on Fitted Value. Thus, there is no particular pattern between the residuals and the fitted values whose shape indicates for a regression relationship.
2. Normality Assumption
Equal Variance Assumption can be checked by the Scale-Location plot in Diagnostics 2 . Based on the Scale-Location plot , the residuals are randomly spread pints. There is no pattern.This is an indication that the equal variance assumption is not violated. The histogram of the residuals should be examined to make a decision of tolerating these blue points that are not behind orange line. In addition, examing the histogram of residuals is necessary to to see whether the normality assumption is reasonable or not.
3. Histogram of fitted residuals The residuals follow the normal distribution because the distribution of the residuals is unimodal and about symmetric.
4.Equality Variance Based on the Scale-Location plot , the residuals are randomly spread pints. There is no pattern.This is an indication that the equal variance assumption is not violated.
5. Residuals vs Leverage There is no leverage because all points blew leverage 1.0.
5.Results
The answers of these questions the researcher look for :
1-what regressors that are suggested to be on the model ?
According to Stepwise Regression and Subset selection, using all regressors, use materials [Cement (X1), Blast Furnace Slag(X2),Fly Ash(X3), Water(X4), Superplasticizer(X5), Coarse Aggregate(X6), Fine Aggregate(X7)] and Concrete Age [Age(X8)], is recommended because this model has has the lowest Root Mean Square Error (RMSE = 0.6339170), the highest Coefficient of determination (R-squared = 0.5973950), and lowest Mean Absolute Error (MAE = 0.5039133).
2- What is type of the model that have a good fit the dataset?
According to diagnostics of 3 order of Multivariate Polynomial Regression with 8 regressors, There is no issue to use this model to estimate Concrete compressive strength (MPa), using Cement (kg in a m3 mixture), Blast Furnace Slag (kg in a m3 mixture), fly Ash (kg in a m3 mixture), Water (kg in a m3 mixture), superplasticizer (kg in a m3 mixture), Coarse Aggregate (kg in a m3 mixture), Fine Aggregate (kg in a m3 mixture), and Age (Day (1~365)) in the model. Thus, the type of the model that have a good fit the dataset is Multivariate Polynomial Regression with 8 Regressors.
6.Discussion / Conclusions
Multivariate Polynomial Regression with 8 Regressors has Root Mean Square Error (RMSE = 0.6339170), Coefficient of determination (R-squared = 0.5973950), and Mean Absolute Error (MAE = 0.5039133) using all independent variables in the model.
Future study:
It will be about removing the problematic points and making compersing the model before and after. What going to be changed. Study coefficient of liquid regressors of use materials that are in Multivariate Polynomial Regression with 8 Regressors what do they affect on Concrete Compressive Strength?
##7. References
- Fly Ash Facts for Highway Engineers, web: https://www.fhwa.dot.gov/pavement/recycling/fach01.cfm, [Nov 25, 2019]
- IS-11, “Slag Cement and Fly Ash”, Slag cement Association, web: https://www.slagcement.org/aboutslagcement/is-11.aspx, [Nov 26, 2019]
- National Slag Association, " Blast Furnace Slag " , web : http://www.nationalslag.org/blast-furnace-slag, [Nov 27, 2019]
- Ruwan Rajapakse, " Construction Engineering Design Calculations and Rules of Thumb", ScienceDirect, web: https://www.sciencedirect.com/topics/engineering/superplasticizer, [Nov 25, 2019]
Concrete_Data.xls, “Concrete Compressive Strength Data Set”, Machine learning Repository, the UC Irvine Machine Learning Repository!, web : https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength [Nov 2, 2019]
Concrete Compressive Strength Data Set belongs to Original Owner and Donor Prof. I-Cheng Yeh, Department of Information Management, Chung-Hua University, Hsin Chu, Taiwan 30067, R.O.C., e-mail:icyeh@chu.edu.tw, TEL:886-3-5186511, Date Donated: August 3, 2007.
---
title: "ECCS from UM and CA"
author: "Ahmed Alzahrani"
output:
flexdashboard::flex_dashboard:
theme: simplex
orientation: columns
social: ["facebook", "twitter", "linkedin"]
source_code: embed
---
```{r setup, include=FALSE}
# load necessary packages
library(ggplot2)
library(plotly)
library(plyr)
library(flexdashboard) ## you need this package to create dashboard
# read the data set here, I use data: mtcars as an example
df <- df <- read.csv("D:/General folder_for_all files/Department of Civil & Environmental Engineering & Engineering Mechanics/Fall_2019/Linear Model/project/Concrete_Data.csv")
```
```{r , include=FALSE}
colnames(df) [0:9] = c( "X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8","Y")
```
```{r , include=FALSE}
df_new <- unique(df)
```
```{r , include=FALSE}
colnames(df_new) [0:9] = c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "Y")
```
```{r , include=FALSE}
df_new <- unique(df)
index1 = which(df$.=="?" )
index2 = which(df$. == "NA")
index3 = which(df$Y == "0")
```
A and I
=======================================================================
**Estimating Concrete Compressive Strength from Use Materials and the Concrete Age (ECCS from UM and CA)**
\
**Abstract**
\
This project uses Prof. I-Cheng Yeh's dataset of (Concrete Data.xls) that contains more than 1000 observations with nine variables. The variable of Concrete Compressive Strength (MPa) is related to Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate, and age of samples. An acceptable model is needed to be found to have an equation to estimate Concrete Compressive Strength utilizing eight independent variables from dataset. The dataset is scaled. Subset Selection Methods and Stepwise Regression are used resulting in using all regressors obtaining higher Adjusted R squared and lower Root Mean Square Error. The 3 order of Multivariate Polynomial Regression with eight regressors has fitted utilizing 1005 observations that are not duplicated. The fitted model is significant and has 0.909 of an adjusted R squared and 62.17 of F-statistic.
> # **2. Introduction**
---
\
> ####
Concrete Compressive Strength is one of the main tests of concrete samples. This test, used for Compressing Concrete samples for education purposes, research purposes, or measuring quality, determines how strong hardened concrete samples are. It is known that the ratio of cement water plays a role of increasing or decreasing the strength of concrete. Choosing appropriate ratio of water to cement is considered essential for making concrete whose compressive strength is significant.Every decreasing in the weight ratio of cement water leads to an increase in the compressive strength of concrete. The low ratio of cement weight to water weight in concrete mixtures causes poor workability, which means it is hard to be mixed (concrete mixtures). Plasticizers are used to improve the workability of a concrete mixture that is desired to contain less amount of water for raising a compressive Strength. Fly ash and slag cement (Class F) are used in a mixture of concrete to protect the concrete from Alkali-Silica Reaction and sulfate attack (IS-11). This The dataset that is used in this project belongs to Prof. I-Cheng Yeh's. This data, about component of concrete, contains nine variables and 1030 records of each variable. This dataset has eight regressors or/and independent variables which are cement (kg in a m3 mixture), Blast Furnace Slag (kg in a m3 mixture), Fly Ash (kg in a m3 mixture), Water(kg in a m3 mixture), Superplasticizer (kg in a m3 mixture), Coarse Aggregate (kg in a m3 mixture), Fine Aggregate (kg in a m3 mixture), Age (1 ~ 365 days). These quantitative independent variables control of concrete compressive strength (MPa). It is a chance to find a good model of this dataset and obtain an equation from which the researcher can learn how use materials and the age of concrete: - how many days concrete samples since it has been made, determine the strength of concrete samples or the concrete.
The researcher look for answers of these questions:
1-what regressors that are suggested to be on the model ?
2- What is type of the model that have a good fit the dataset?
\
\
-----------------------------------------------------------------------
Methods
=======================================================================
\
```{r , include=FALSE}
df_new <- na.omit(df_new)
```
```{r, include=FALSE}
df_new <- as.data.frame(apply(df_new[0:9], 2, scale))
```
> # a. Data
### The matrix of scatterplots for the data was examined
#### Matrix of scatterplots of Y, X1, X5, and X8
```{r }
pairs ( ~ Y + X1 + X5 + X8, df_new)
```
\
Column {.tabset data-width=450}
-----------------------------------------------------------------------
### Motivation
\
> ## I know that the ratio of water to cement has an impact on the compressive strength of concrete. \
> ## My motivation is that \
> ## I want to know that the effects of each independent variable, including water and cement in this\
> ## model how concrete compressive strength will got affacted by regressor coefficients. \
\
### Information
\
> ### 1- **Cement** is the main element of concrete. Its job is to glue other concrete parts such as sand and aggregate to gather them after mixing with water.
> ### 2- **Fine aggregates** are aggregates that are passing through a 4.76 mm sieve.
> ### 3- And remained aggregates are not passing the sieve, are called **Coarse Aggregate**.
> ### 4- Water is no need to be defined.
> ### 5- **Super-plasticizers** are used to maintain high workability while at the same time maintaining strength ( Ruwan Rajapakse).
> ### 6- "**Fly ash** is the finely divided residue that results from the combustion of pulverized coal and is transported from the combustion chamber by exhaust gases" (Fly Ash Facts for Highway Engineers)
> ### 7- "**Blast Furnace Slag** is formed when iron ore or iron pellets, coke and a flux (either limestone or dolomite) are melted together in a blast furnace. When the metallurgical smelting process is complete, the lime in the flux has been chemically combined with the aluminates and silicatesof the ore and coke ash to form a non-metallic product called blast furnace slag (National Slag Association).
> ### 8- **Age** of concrete
\
### Variables
\
> ### X1 : Cement (kg in a m3 mixture)
> ### X2 : Blast Furnace Slag (kg in a m3 mixture)
> ### X3 : Fly Ash (kg in a m3 mixture)
> ### X4 : Water (kg in a m3 mixture)
> ### X5 : Superplasticizer (kg in a m3 mixture)
> ### X6 : Coarse Aggregate (kg in a m3 mixture)
> ### X7 : Fine Aggregate (kg in a m3 mixture)
> ### X8 : Age (Day (1~365))
> ### Y : Concrete compressive strength (MPa)
\
### First
```{r, include= FALSE}
fitLY1 = lm(X1~X8, df_new)
fitPY1 = lm(X1~poly(X8, degree = 3 , raw=F), df_new)
```
\
> #### The scatterplot for X1 vs X8 shows a potential curve relationship between X1 and X8 because of existing points in the middle of X8 = 2 and 5.
\
```{r }
plot(df_new$X8, df_new$X1, col="black",cex.lab=1, xlab="X8 : Age (Day (1~365))",
ylab="X1 : Cement (kg in a m3 mixture)", main = "X1 vs X8")
abline (fitLY1, col="#0389ab", lwd = 1)
```
### First S
```{r }
summary(fitLY1)
summary(fitPY1)
```
### Second
> #### The scatterplot for X5 vs X1 shows a potential curve relationship between X5 and X1.
\
```{r , include=FALSE }
fitL51 = lm(X5~X1, df_new)
fitP51 = lm(X5~poly(X1, degree = 3 , raw=F), df_new)
```
```{r }
plot(df_new$X1, df_new$X5, col="red",cex.lab=1, xlab="X1: Cement (kg in a cubic meter mixture)",
ylab="X5: Superplasticizer (kg in a cubic meter mixture)", main = "X5 vs X1")
abline (fitL51, col="#0389ab", lwd = 1)
```
### Second S
```{r }
summary(fitL51)
summary(fitP51)
```
### Third
```{r, include=FALSE }
fitLY8 = lm(Y~X8, df_new)
fitPY8 = lm(Y~poly(X8, degree = 3 , raw=F), df_new)
```
```{r }
plot(df_new$X8, df_new$Y, col="red",cex.lab=1, xlab="X8 : Age (Day (1~365))",
ylab="Y : Concrete compressive strength (MPa)", main = "Y vs X8")
abline (fitLY8, col="#0389ab", lwd = 1)
```
### Third S
```{r }
summary(fitLY8)
summary(fitPY8)
```
### Fourth
```{r, include=FALSE}
fitL18 = lm(X1~X8, df_new)
fitP18 = lm(X1~poly(X8, degree = 3 , raw=F), df_new)
```
```{r}
plot(df_new$X8, df_new$X1, col="#044715",cex.lab=1, xlab="X8 : Age (Day (1~365))",
ylab="X1 : Cement (kg in a m3 mixture)", main = "X1 vs X8")
abline (fitL18, col="#0389ab", lwd = 1)
```
### Fourth S
```{r }
summary(fitL18)
summary(fitP18)
```
### Fifth
```{r, include=FALSE}
fitL58 = lm(X5~X8, df_new)
fitP58 = lm(X5~poly(X8, degree = 3 , raw=F), df_new)
```
```{r}
plot(df_new$X8, df_new$X5, col="#044715",cex.lab=1, xlab="X8 : Age (Day (1~365))",
ylab="X5 : Superplasticizer (kg in a m3 mixture)", main = "X1 vs X8")
abline (fitL58, col="#0389ab", lwd = 1)
```
### Fifth S
```{r }
summary(fitL58)
summary(fitP58)
```
---
\
\
----------------------------------------------------------------------------------------------------------------------------------------------
D and A
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
Data and Analytic Methods
\
> ## **b. Data management**
- ### All column names are modified to be X1, X2, X3 ~ X8, and Y
- ### The dataset was checked from any missing value and Null value. There was no missing value and no null value on the dataset.
- ### Duplicate values were deleted, and there were 25 duplicate observations
- ### The data was scaled
> ## **c. Analytic Methods**
- ### Matrix of scatterplots for the data was examined
- ### Correlation of regressors were examined
- ### Subset Selection Methods and Stepwise Regression were used to determine what variables should be used.
- ### The 3 order of Multivariate Polynomial Regression with eight regressors model was made.
- ### F statistical and adjusted R squared was checked.
- ### Investigation of model adequacy was done
---
\
Column {data-width=500}
-----------------------------------------------------------------------
\
### **Correlation**
> #### The correlation of regressors matrices show that X5 vs X4 high correlation of -0.64694605. This is an indication of collinearity issue.
```{r}
cor(df_new[,-9])
```
---
\
### **Stepwise Regression**
```{r, include=FALSE }
library (leaps)
library (caret)
```
```{r, include=FALSE}
set.seed (2019)
train.control = trainControl(method ="cv", number = 10)
model.stepwise <- train(Y~., data =df_new, method ="leapSeq", tuneGrid = data.frame(nvmax = 1:8), trControl = train.control)
```
> #### Stepwise Regression suggests having 8 regressors on the model because this model has the lowest Root Mean Square Error (RMSE = 0.6339170), the highest Coefficient of determination (R-squared = 0.5973950), and lowest Mean Absolute Error (MAE = 0.5039133).
```{r }
model.stepwise$results
```
---
\
### **Subset selection**
> #### We can choose wanted variables in the model if the R Squared and RMSE are chosen from Stepwise Regression
```{r, include=FALSE}
fit.subsets = regsubsets(Y ~ ., data = df_new, nvmax = 8)
```
```{r }
summary(fit.subsets)
```
\
-------------------------------------------------------------------------
Polynomial
=======================================================================
\
> ## **Adjusted R squared**
> ### This section includes the coefficient determination of Adjusted R squared of the third order of Multivariate Polynomial Regression with eight regressors. The Adjusted R Squared is about 91%. About 91% of the total variation in the Concrete compressive strength (MPa) can be explained by the third order polynomial model using the eight regressors.
\
```{r, inculde = F }
fit= lm(Y~ polym(X1,X2,X3,X4,X5,X6,X7,X8, degree = 3, raw = F) ,df_new)
```
```{r }
print(" [1] is the coefficient determination of 3 order of Multivariate Polynomial Regression with 8 regressors ")
summary(fit)$adj.r.squared
```
\
\
\
## **F statistic**
> #### $H_0: \beta_i=0, i=1,2,3,4,5,6,7,8$ versus $H_1:$ at least one of $\beta_i\neq0, i=1,2,3,4,5,6,7,8$.
### **Based on the output above, we can find that the F-statistic is 62.166 and the p-value is 2.2e-16 < 0.05= $\alpha$. We reject $H_0$ at the level of significance 0.01. We have sufficient evidence that using the polynomial model with Cement (X1, kg in a m3 mixture), Blast Furnace Slag (X2, kg in a m3 mixture), Fly Ash (X3, kg in a m3 mixture), Water (X4, kg in a m3 mixture), Superplasticizer (X5, kg in a m3 mixture), Coarse Aggregate (X6, kg in a m3 mixture), Fine Aggregate (X7, kg in a m3 mixture), Age (X8, Day (1~365)) is better than just using the mean Concrete compressive strength (Y, MPa).**
Column {data-width=800}
-----------------------------------------------------------------------
```{r }
anova(fit)
```
---
\
-----------------------------------------------------------------------
Diagnostics
=======================================================================
Column {.tabset data-width=650}
-----------------------------------------------------------------------
```{r, include=F}
library(plotly)
library(ggplot2)
```
```{r}
#obtain values needed in order to get diagnostics plots
# Extract fitted values
Fitted.Values <- fit$fitted.values
# Extract residuals
Residuals <- fit$residuals
# Calculate standardized residuals
Standardized.Residuals <- scale(fit$residuals)
# Extract fitted values for lm() object
Theoretical.Quantiles <- qqnorm(Residuals, plot.it = F)$x
# find Square root of abs(residuals)
Root.Residuals <- sqrt(abs(Standardized.Residuals))
# Calculate Leverage
Leverage <- lm.influence(fit)$hat
# Create data frame
# Will be used as input to plot_ly
diagnostics <- data.frame(Fitted.Values,
Residuals,
Standardized.Residuals,
Theoretical.Quantiles,
Root.Residuals,
Leverage)
```
### Linearity
```{r}
m <- list(
l = 100,
r = 100,
b = 100,
t = 100,
pad = 4
)
# Fitted vs Residuals
p1 <- plot_ly(diagnostics, x = Fitted.Values, y = Residuals,
type = "scatter", mode = "markers", hoverinfo = "x+y", name = "df_new",
marker = list(size = 10, opacity = 0.5))%>%
layout(title = "Residuals vs Fitted Values",
xaxis = list(title="Fitted Values", font=list(size=14)),
yaxis = list(title="Residuals", font=list(size=14)),
plot_bgcolor = "#e6e6e6",
font=list(size=14), margin=m)
ggplotly(p1)
```
Column {data-width=350}
-----------------------------------------------------------------------
\
\
\
**1. Linearity Assumption**
We check the linearity assumption: After checking the residual vs fitted value plot, most blue plots locate between 0.5 and -0.5 on residual axes and -2 and 2 on Fitted Value. Thus, there is no particular pattern between the residuals and the fitted values whose shape indicates for a regression relationship.
**2. Normality Assumption**
Equal Variance Assumption can be checked by the Scale-Location plot in Diagnostics 2 .
Based on the Scale-Location plot , the residuals are randomly spread pints. There is no pattern.This is an indication that the equal variance assumption is not violated. The histogram of the residuals should be examined to make a decision of tolerating these blue points that are not behind orange line. In addition, examing the histogram of residuals is necessary to to see whether the normality assumption is reasonable or not.
**3. Histogram of fitted residuals**
The residuals follow the normal distribution because the distribution of the residuals is unimodal and about symmetric.
**4.Equality Variance**
Based on the Scale-Location plot , the residuals are randomly spread pints. There is no pattern.This is an indication that the equal variance assumption is not violated.
**5. Residuals vs Leverage**
There is no leverage because all points blew leverage 1.0.
Diagnostics 2
=======================================================================
Column
-----------------------------------------------------------------------
### 2. Normality Assumption
```{r}
# QQ Pot
p2 <- plot_ly(diagnostics, x = Theoretical.Quantiles, y = Standardized.Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "df_new", marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
add_trace(x = Theoretical.Quantiles, y = Theoretical.Quantiles, type = "scatter", mode = "line", name = "", line = list(width = 2))%>%
layout(title = "Q-Q Plot", plot_bgcolor = "#e6e6e6",
xaxis = list(title="Theoretical Quantiles", font=list(size=14)),
yaxis = list(title="Standardized Residuals", font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p2)
```
### 3. Histogram of fitted residuals : the residuals follow the normal distribution because the distribution of the residuals is unimodal and about symmetric.
```{r}
df5 <- data.frame(fit$residuals)
group <- sample(LETTERS[1:5], size = 1005, replace = T)
p <- ggplot(df5, aes(fit$residuals)) +
geom_histogram(aes(y = ..density..), alpha = 0.7, fill = "#333333") +
geom_density(fill = "#ff4d4d", alpha = 0.5) + theme(panel.background = element_rect(fill = '#ffffff')) + ggtitle("Density with Histogram overlay")
ggplotly(p)
```
Column
-----------------------------------------------------------------------
### 4.Equality Variance
```{r}
# Scale Location
p3 <- plot_ly(diagnostics, x = Fitted.Values, y = Root.Residuals,
type = "scatter", mode = "markers", hoverinfo = "x+y", name = "df_new",
marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
layout(title = "Scale-Location", plot_bgcolor = "#e6e6e6", xaxis = list(title="Fitted Values", font=list(size=14)),
yaxis = list(title=expression(sqrt("|Standardized Residuals|")), font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p3)
```
### 5. Residuals vs Leverage
```{r}
s <- loess.smooth(Leverage, Residuals)
p4 <- plot_ly(diagnostics, x = Leverage, y = Residuals,
type = "scatter", mode = "markers", hoverinfo = "x+y", name = "df_new", marker = list(size = 10, opacity = 0.5), showlegend = F) %>%
add_trace(x = s$x, y = s$y, type = "scatter", mode = "line", name = "Smooth", line = list(width = 2)) %>%
layout(title = "Leverage vs Residuals", plot_bgcolor = "#e6e6e6", xaxis = list(title="Leverage", font=list(size=14)),
yaxis = list(title="Residuals", font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p4)
```
-------------------------------------------------------------------------
Results and Conclusions
=======================================================================
\
> ## **5.Results**
> The **answers** of these questions the researcher look for :
\
> **1-what regressors that are suggested to be on the model ?**
> According to **Stepwise Regression and Subset selection**, using all regressors, use materials [Cement (X1), Blast Furnace Slag(X2),Fly Ash(X3), Water(X4), Superplasticizer(X5), Coarse Aggregate(X6), Fine Aggregate(X7)] and Concrete Age [Age(X8)], is recommended because this model has has the lowest Root Mean Square Error (RMSE = 0.6339170), the highest Coefficient of determination (R-squared = 0.5973950), and lowest Mean Absolute Error (MAE = 0.5039133).
> 2- What is type of the model that have a good fit the dataset?
>According to diagnostics of 3 order of Multivariate Polynomial Regression with 8 regressors, There is no issue to use this model to estimate Concrete compressive strength (MPa), using Cement (kg in a m3 mixture), Blast Furnace Slag (kg in a m3 mixture), fly Ash (kg in a m3 mixture), Water (kg in a m3 mixture), superplasticizer (kg in a m3 mixture), Coarse Aggregate (kg in a m3 mixture), Fine Aggregate (kg in a m3 mixture), and Age (Day (1~365)) in the model. Thus, the type of the model that have a good fit the dataset is Multivariate Polynomial Regression with 8 Regressors.
> ## **6.Discussion / Conclusions**
> Multivariate Polynomial Regression with 8 Regressors has Root Mean Square Error (RMSE = 0.6339170), Coefficient of determination (R-squared = 0.5973950), and Mean Absolute Error (MAE = 0.5039133) using all independent variables in the model.
> ## **Future study:**
>It will be about removing the problematic points and making compersing the model before and after. What going to be changed.
Study coefficient of liquid regressors of use materials that are in Multivariate Polynomial Regression with 8 Regressors what do they affect on Concrete Compressive Strength?
> ##**7. References**
>- Fly Ash Facts for Highway Engineers, web: https://www.fhwa.dot.gov/pavement/recycling/fach01.cfm, [Nov 25, 2019]
>- IS-11, "Slag Cement and Fly Ash", Slag cement Association, web: https://www.slagcement.org/aboutslagcement/is-11.aspx, [Nov 26, 2019]
>- National Slag Association, " Blast Furnace Slag " , web : http://www.nationalslag.org/blast-furnace-slag, [Nov 27, 2019]
>- Ruwan Rajapakse, " Construction Engineering Design Calculations and Rules of Thumb", ScienceDirect, web: https://www.sciencedirect.com/topics/engineering/superplasticizer, [Nov 25, 2019]
> Concrete_Data.xls, "Concrete Compressive Strength Data Set", Machine learning Repository, the UC Irvine Machine Learning Repository!, web : https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength [Nov 2, 2019]
>Concrete Compressive Strength Data Set belongs to Original Owner and Donor Prof. I-Cheng Yeh, Department of Information Management, Chung-Hua University,
Hsin Chu, Taiwan 30067, R.O.C., e-mail:icyeh@chu.edu.tw, TEL:886-3-5186511, Date Donated: August 3, 2007.